Big Data Coding (Chicago Redlining)¶
This assignment looks at the correlation between historic redlining in the city of Chicago, greenspaces in the city, and asthma prevalence.
# Import libraries
import os
import pathlib
import time
import warnings
import geopandas as gpd
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import pystac_client
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
import xarray as xr
from cartopy import crs as ccrs
from scipy.ndimage import convolve
from sklearn.model_selection import KFold
from scipy.ndimage import label
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
import zipfile
data_dir = os.path.join(
pathlib.Path.home(),
'earth-analytics',
'data',
'chicago-greenspace')
os.makedirs(data_dir, exist_ok=True)
warnings.simplefilter('ignore')
# Prevent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
c:\Users\moenc\miniconda3\envs\earth-analytics-python\Lib\site-packages\dask\dataframe\__init__.py:42: FutureWarning: Dask dataframe query planning is disabled because dask-expr is not installed. You can install it with `pip install dask[dataframe]` or `conda install dask`. This will raise in a future version. warnings.warn(msg, FutureWarning)
# set up the census tract directory and path
ct_dir = os.path.join (
pathlib.Path.home(),
'EDA-Spring-2025',
'big-data-allenmoench',
'chicago-tract'
)
os.makedirs(ct_dir, exist_ok=True)
ct_path = os.path.join(ct_dir, 'chicago-path.shp')
# Download the census tracts (only once)
if not os.path.exists(ct_path):
ct_url = ("https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip")
ct_gdf = gpd.read_file(ct_url)
chi_tract_gdf = ct_gdf[ct_gdf.PlaceName=='Chicago']
chi_tract_gdf.to_file(ct_path, index=False)
# load census data
chi_tract_gdf = gpd.read_file(ct_path)
# Plot the census tract site with satellite imagery in the background
(
chi_tract_gdf
.to_crs(ccrs.Mercator())
.hvplot(
line_color='orange', fill_color=None,
crs=ccrs.Mercator(), tiles='EsriImagery',
frame_width=600)
)
Reflect and Respond¶
Simply from looking at the coarse map, there seem to be certain small subsections that have more green than others. According to the City of Chicago's website, there are a number of Open Space and Sustainability plans that have been implemented over time. Some of the greenspace visible in the ESRI tiles does seem to correspond with these plans. In particular, the areas outlined in the Chicago Sustainable Industries plan from 2011 are noticeably coincident with areas where more greenspace is visible. For example, the corridor near I-55, and South Deering (the furthest Southern part of Chicago) both have sustainability plans, and noticeable green areas visible in the ESRI tiles.
Source: City of Chicago Planning and Development, Open Space and Sustainability Plans. 2/1/2025, https://www.chicago.gov/city/en/depts/dcd/provdrs/planning_and_policydivision/svcs/open-space-and-sustainability-plans.html
Data Description and citation¶
- TIGER/line shapefiles, Chicago. The shapefile for the city of Chicago boundaries was downloaded from the US Census Bureau (census.gov) TIGER/line shapefiles.
United States Census Bureau, Geography Division, 2/1/2025, https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2024&layergroup=Census+Tracts
- CDC 500 Cities & Places dataset. The CDC Places database covers the entire United States, providing geospatial data for a variety of health conditions. The datsets are calculated using models which draw from several other databases including the Behavioral Risk Factor Surveillance System (BRFSS), Census Bureau population data, and American Community Survey. These databases are provided by several organizations includidng the Centers for Disease Control and Prevention (CDC), National Center for Chronic Disease Prevention and Health Promotion, and Division of Population Health.
PLACES: Local Data for Better Health, Census Tract Data 2024 release. Last updated August 23, 2024. Accessed 2/1/2025. https://data.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh/about_data
- National Agricultural Imagery Program (NAIP). NAIP is administrated by the USDA and FPAC-BC (Farm Production and Conservation Business Center). It's goal is to collect very high resolution (60 cm or better) imagery for the entire United States. NAIP usees four-band imagery containing red, green, blue, and near-infrared bands. To collect the imagery, the USDA employs contractors who use aircraft with precisely calibrated digital cameras. In some cases, sattelite imagery is used as well, and in some places even higher resolutions can be found. National Agriculture Imagery Program (NAIP) Information Sheet, April 2023
Nationa Agricultural Imagery Program (NAIP) GeoHub. Accessed 2/1/2025. https://naip-usdaonline.hub.arcgis.com/
Download census tracts and select your urban area¶
# # This is my attempt to see if I can get the same data as listed in the 'solutions' (didn't work due to the URL being incorrectly formatted)
# # define the info for the CDC download, specifying Chicago by using the FIPS number
# ch_as_url=(
# "https://data.cdc.gov/resource/eav7-hnsx.geojson",
# "?locationid=17")
# # Make a directory for the chicago asthma data
# ch_as_dir = os.path.join(ct_dir, 'chicago-asthma')
# os.makedirs(ch_as_dir, exist_ok=True)
# ch_as_path = os.path.join(ch_as_dir, '*.shp')
# # download CDC data only once
# if not os.path.exists(ch_as_path):
# ch_as_gdf = gpd.read_file(ch_as_url)
# ch_as_gdf.to_file(ch_as_path)
# # load data
# ch_as_gdf = gpd.read_file(ch_as_path)
# # check data (plot or print)
# ch_as_gdf.head()
# Set up a path for the asthma data
asthma_path = os.path.join(data_dir, 'asthma.csv')
# Download asthma data (only once)
if not os.path.exists(asthma_path):
asthma_url = (
"https://data.cdc.gov/resource/cwsq-ngmh.csv"
"?year=2022"
"&stateabbr=IL"
"&countyname=Cook"
"&measureid=CASTHMA"
"&$limit=1500"
)
asthma_df = (
pd.read_csv(asthma_url)
.rename(columns={
'data_value': 'asthma',
'low_confidence_limit': 'asthma_ci_low',
'high_confidence_limit': 'asthma_ci_high',
'locationname': 'tract'
})
[[
'year', 'tract',
'asthma', 'asthma_ci_low', 'asthma_ci_high', 'data_value_unit',
'totalpopulation', 'totalpop18plus'
]]
)
asthma_df.to_csv(asthma_path, index=False)
# Load in asthma data
asthma_df = pd.read_csv(asthma_path)
# Preview asthma data
asthma_df
| year | tract | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2022 | 17031031100 | 8.4 | 7.5 | 9.5 | % | 4691 | 4359 |
| 1 | 2022 | 17031031900 | 8.6 | 7.7 | 9.7 | % | 2522 | 2143 |
| 2 | 2022 | 17031062600 | 8.3 | 7.3 | 9.3 | % | 2477 | 1760 |
| 3 | 2022 | 17031070101 | 8.9 | 7.9 | 9.9 | % | 4171 | 3912 |
| 4 | 2022 | 17031081100 | 9.0 | 8.0 | 10.1 | % | 4187 | 3951 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1323 | 2022 | 17031834900 | 13.5 | 12.1 | 15.0 | % | 1952 | 1451 |
| 1324 | 2022 | 17031828601 | 9.8 | 8.8 | 11.0 | % | 4198 | 3227 |
| 1325 | 2022 | 17031843700 | 8.4 | 7.5 | 9.5 | % | 2544 | 1891 |
| 1326 | 2022 | 17031829700 | 10.8 | 9.6 | 12.1 | % | 3344 | 2524 |
| 1327 | 2022 | 17031829100 | 10.2 | 9.1 | 11.5 | % | 3512 | 2462 |
1328 rows × 8 columns
Join health data with census tract boundaries¶
# Change tract identifier datatype for merging
# Note: what is "tract 2010"?
chi_tract_gdf.tract2010 = chi_tract_gdf.tract2010.astype('int64')
# Merge census data with geometry
tract_cdc_gdf = (
chi_tract_gdf
.merge(asthma_df, left_on='tract2010', right_on='tract', how='inner')
)
# Plot asthma data as chloropleth
# Note: what is 'gv', 'gv.polygons', 'vdim', and '.opts'?
(
gv.tile_sources.EsriImagery
*
gv.Polygons(
tract_cdc_gdf.to_crs(ccrs.Mercator()),
vdims=['asthma', 'tract2010'],
crs=ccrs.Mercator()
).opts(colorbar=True, tools=['hover'])
).opts(width=600, height=600, xaxis=None, yaxis=None)
There are two main "clusters" in the city, where asthma is more prevalent. One is in the mid-Northern part of the city, just to the West of the city's center, and the other is in the South. If this data is compared with population and ethnicity data, it's immediately apparent that the areas with the greatest asthma prevalence coincide closely with the areas in which most people of color are living. The exact reason for high asthma incidence in these areas was not apparent from my research - presumably they are areas closer to industrial centers or highways. However my research did indicate that there is a history of racial discrimination in the city, some of which is "cemented" (in some cases quite literally) in place by redlining and other forms of systemic racism which relegate black and hispanic communities to poorer and more polluted areas of the city.

Source: wttw, South Side Weekly. "Firsthand Segregation: Mapping Chicago's Racial Segregation" (Jaqueline Serrato, Pat Sier, Charmaine Runes).
Get Data URLs¶
# Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
e84_catalog.title
'Microsoft Planetary Computer STAC API'
# Convert geometry to lat/lon for STAC
# Question: what is STAC?
tract_latlon_gdf = tract_cdc_gdf.to_crs(4326)
# Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'chicago_ndvi_stats.csv')
# Check for existing data - do not access duplicate tracts
# Note: what does 'ndvi.tract.values' do?
downloaded_tracts = []
if os.path.exists(ndvi_stats_path):
ndvi_stats_df = pd.read_csv(ndvi_stats_path)
downloaded_tracts = ndvi_stats_df.tract.values
else:
print('no census tracts downloaded so far')
# Loop through each census tract
# Note: what is 'i', what is 'tqdm' and what is .iterrows()?
scene_dfs = []
for i, tract_values in tqdm(tract_latlon_gdf.iterrows()):
tract = tract_values.tract2010
# Check if statistics are already downloaded for this tract
if not (tract in downloaded_tracts):
# Repeat up to 5 times in case of a momentary disruption
i = 0
retry_limit = 5
while i < retry_limit:
# Try accessing the STAC
try:
# Search for tiles
# what is "intersects=shapely.to_geojson(tract_values.geometry)"? I understand .to_geojson, but not intersects=shapely. presumably .geometry converts tract values to geometry...
naip_search = e84_catalog.search(
collections=["naip"],
intersects=shapely.to_geojson(tract_values.geometry),
datetime="2021"
)
# Build dataframe with tracts and tile urls
# I understand this is appending the "scene df's" to the list we made previously...
# presumably the code is querying naip for different sets of data that are defined by different 'scenes' and or dates.
# I don't understand the last line here: "rgbir_href=[scene.assets['image'].href for scene in naip_search.it]". What is this doing?
scene_dfs.append(pd.DataFrame(dict(
tract=tract,
date=[pd.to_datetime(scene.datetime).date()
for scene in naip_search.items()],
rgbir_href=[scene.assets['image'].href for scene in naip_search.items()],
)))
break
# Try again in case of an APIError
except pystac_client.exceptions.APIError:
print(
f'Could not connect with STAC server. '
f'Retrying tract {tract}...')
time.sleep(2)
i += 1
continue
# Concatenate the url dataframes
if scene_dfs:
scene_df = pd.concat(scene_dfs).reset_index(drop=True)
else:
scene_df = None
print("Error: Scene_df = None")
# Preview the URL DataFrame
scene_df
0it [00:00, ?it/s]
Error: Scene_df = None
Compute NDVI Statistics¶
Next, calculate some metrics to get at different aspects of the
distribution of greenspace in each census tract. These types of
statistics are called fragmentation statistics, and can be
implemented with the scipy package. Some examples of fragmentation
statistics are:
Percentage vegetation
The percentage of pixels that exceed a vegetation threshold (.12 is
common with Landsat) Mean patch size
The average size of patches, or contiguous area exceeding the
vegetation threshold. Patches can be identified with the label
function from scipy.ndimage Edge density
The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.
What is convolution?
If you are familiar with differential equations, convolution is an approximation of the LaPlace transform.
For the purposes of calculating edge density, convolution means that we are taking all the possible 3x3 chunks for our image, and multiplying it by the kernel:
$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$
The result is a matrix the same size as the original, minus the outermost edge. If the center pixel is the same as the surroundings, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.
- Select a single row from the census tract
GeoDataFrameusing e.g.Â.loc[[0]], then select all the rows from your URLDataFramethat match the census tract. - For each URL in crop, merge, clip, and compute NDVI for that census tract
- Set a threshold to get a binary mask of vegetation
- Using the sample code to compute the fragmentation statistics. Feel free to add any other statistics you think are relevant, but make sure to include a fraction vegetation, mean patch size, and edge density. If you are not sure what any line of code is doing, make a plot or print something to find out! You can also ask ChatGPT or the class.
chi_tract_gdf
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 1714000 | 17031010100 | 17 | Chicago | 1714000-17031010100 | 4854 | POLYGON ((-9758835.381 5164429.383, -9758837.3... |
| 1 | 1714000 | 17031010201 | 17 | Chicago | 1714000-17031010201 | 6450 | POLYGON ((-9760143.496 5163888.741, -9760143.4... |
| 2 | 1714000 | 17031010202 | 17 | Chicago | 1714000-17031010202 | 2818 | POLYGON ((-9759754.212 5163883.196, -9759726.6... |
| 3 | 1714000 | 17031010300 | 17 | Chicago | 1714000-17031010300 | 6236 | POLYGON ((-9758695.229 5163870.91, -9758695.78... |
| 4 | 1714000 | 17031010400 | 17 | Chicago | 1714000-17031010400 | 5042 | POLYGON ((-9757724.634 5160715.939, -9757742.2... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 804 | 1714000 | 17031835700 | 17 | Chicago | 1714000-17031835700 | 0 | POLYGON ((-9754496.811 5132719.307, -9754491.8... |
| 805 | 1714000 | 17031980000 | 17 | Chicago | 1714000-17031980000 | 0 | POLYGON ((-9788665.342 5161808.277, -9788667.6... |
| 806 | 1714000 | 17031980100 | 17 | Chicago | 1714000-17031980100 | 0 | POLYGON ((-9766922.964 5128945.613, -9766938.5... |
| 807 | 1714000 | 17043840000 | 17 | Chicago | 1714000-17043840000 | 0 | POLYGON ((-9787210.28 5154713.902, -9787240.67... |
| 808 | 1714000 | 17043840801 | 17 | Chicago | 1714000-17043840801 | 0 | POLYGON ((-9787431.695 5154736.805, -9787441.0... |
809 rows × 7 columns
## Note: this cell may fail to run if tracts are alreadu downloaded (scene_df is empty). If this is the case, proceed and run the next cell.
# Find matching tracts
matching_tracts = chi_tract_gdf[chi_tract_gdf.tract2010.isin(scene_df.tract)]
# Print the index values of matching rows
print("Matching indices in chi_tract_gdf:", matching_tracts.index.tolist())
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[17], line 4 1 ## Note: this cell may fail to run if tracts are alreadu downloaded (scene_df is empty). If this is the case, proceed and run the next cell. 2 3 # Find matching tracts ----> 4 matching_tracts = chi_tract_gdf[chi_tract_gdf.tract2010.isin(scene_df.tract)] 6 # Print the index values of matching rows 7 print("Matching indices in chi_tract_gdf:", matching_tracts.index.tolist()) AttributeError: 'NoneType' object has no attribute 'tract'
# Skip this step if data are already downloaded
if not scene_df is None:
# Get an example tract
tract = chi_tract_gdf.loc[417].tract2010
ex_scene_gdf = scene_df[scene_df.tract==tract]
# Loop through all images for tract
tile_das = []
for _, href_s in ex_scene_gdf.iterrows():
# Open vsi connection to data
tile_da = rxr.open_rasterio(
href_s.rgbir_href, masked=True).squeeze()
# Crop data to the bounding box of the census tract
boundary = (
tract_cdc_gdf
.set_index('tract2010')
.loc[[tract]]
.to_crs(tile_da.rio.crs)
.geometry
)
crop_da = tile_da.rio.clip_box(
*boundary.envelope.total_bounds,
auto_expand=True)
# Clip data to the boundary of the census tract
clip_da = crop_da.rio.clip(boundary, all_touched=True)
# Compute NDVI
ndvi_da = (
(clip_da.sel(band=4) - clip_da.sel(band=1))
/ (clip_da.sel(band=4) + clip_da.sel(band=1))
)
# Accumulate result
tile_das.append(ndvi_da)
# Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
# Mask vegetation
veg_mask = (scene_da>.3)
# Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
# Count patch pixels, ignoring background at patch 0
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()
# Calculate edge density
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1]])
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size
# This code cell is to keep track of the objects I'm defining... If I forget what's contained in one of them, I can just uncomment and print it.
# tile_das
scene_df
# tract_cdc_gdf
# scene_da
# ex_scene_gdf
# Skip this step if data are already downloaded
if not scene_df is None:
ndvi_dfs = []
# Loop through the census tracts with URLs
# I learned what tqdm is, but I've already forgotten...
for tract, tract_date_gdf in tqdm (scene_df.groupby('tract')):
# Open all images for tract
# Note: what is the _ after 'for' holding space for? What is 'href_s', and I still don't know what iterrows does.
tile_das = []
for _, href_s in tract_date_gdf.iterrows():
# Open vsi connection to data
# what is rgbir?
tile_da = rxr.open_rasterio(
href_s.rgbir_href, masked=True).squeeze()
# Clip data
# what is .set_index?
boundary = (
tract_cdc_gdf
.set_index('tract2010')
.loc[[tract]]
.to_crs(tile_da.rio.crs)
.geometry
)
crop_da = tile_da.rio.clip_box(
*boundary.envelope.total_bounds,
auto_expand=True)
clip_da = crop_da.rio.clip(boundary, all_touched=True)
# Compute NDVI
ndvi_da = (
(clip_da.sel(band=4) - clip_da.sel(band=1))
/ (clip_da.sel(band=4) + clip_da.sel(band=1))
)
# Accumulate result
tile_das.append(ndvi_da)
# Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
# Mask vegetation
veg_mask = (scene_da>.3)
# Calculate statistics and save data to file
total_pixels = scene_da.notnull().sum()
veg_pixels = veg_mask.sum()
# Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
# count patch pixels, ignoring background at patch 0
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()
# Calculate edge density
kernel = np.array([
[1, 1, 1],
[1, -8, 1],
[1, 1, 1, ]])
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size
# Add a row to the statistics file for this tract
pd.DataFrame(dict(
tract=[tract],
total_pixels=[int(total_pixels)],
frac_veg=[float(veg_pixels/total_pixels)],
mean_patch_size=[mean_patch_size],
edge_density=[edge_density]
)).to_csv(
ndvi_stats_path,
mode='a',
index=False,
header=(not os.path.exists(ndvi_stats_path))
)
# Re-load results from file
ndvi_stats_df = pd.read_csv(ndvi_stats_path)
ndvi_stats_df
| tract | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|
| 0 | 17031010100 | 1348088 | 0.140350 | 55.225919 | 0.118612 |
| 1 | 17031010201 | 1632112 | 0.200683 | 57.543394 | 0.161668 |
| 2 | 17031010202 | 1146600 | 0.158785 | 63.260250 | 0.123673 |
| 3 | 17031010300 | 1711710 | 0.146512 | 57.113642 | 0.126384 |
| 4 | 17031010400 | 3119840 | 0.096548 | 52.983817 | 0.079474 |
| ... | ... | ... | ... | ... | ... |
| 783 | 17031843500 | 6963674 | 0.061033 | 9.732779 | 0.104686 |
| 784 | 17031843600 | 1177024 | 0.052817 | 9.177296 | 0.101217 |
| 785 | 17031843700 | 7025535 | 0.023710 | 7.477533 | 0.047642 |
| 786 | 17031843800 | 3786240 | 0.090268 | 24.169295 | 0.105052 |
| 787 | 17031843900 | 8095350 | 0.111222 | 23.985215 | 0.124282 |
788 rows × 5 columns
# Merge census data with geometry
chi_ndvi_cdc_gdf = ( # Creates a new object, to store the merged data:
tract_cdc_gdf # this is the previously defined gdf containing the cdc data merged with the geometry.
.merge(
ndvi_stats_df,
left_on='tract2010', right_on='tract', how='inner') # this is defining the conditions of the join.
)
# Plot chloropleths with vegetation statistics
def plot_chloropleth(gdf, **opts): # This is defining a function. presumably plot_chloropleth is the name of the function, and gdf, **opts are the arguments.
"""Generate a chloropleth with the given color colummn""" # This is text explaining what the function does.
return gv.Polygons(
gdf.to_crs(ccrs.Mercator()), # projecting the gdf to a Mercator coordinate reference system
crs=ccrs.Mercator() # Not sure why this second projection function is necessary. Is this setting a different crs (different from "gdf" in the previous line) to Mercator?
).opts(xaxis=None, yaxis=None, colorbar=True, **opts) # Defining the properties of the chloropleth. I'm still not sure what **opts is doing, exactly.
(
plot_chloropleth(
chi_ndvi_cdc_gdf, color='asthma', cmap='viridis') # Defining the properties of the chloropleth. I think this is running the function we just defined, while inputting the arguments.
+
plot_chloropleth(chi_ndvi_cdc_gdf, color='edge_density', cmap='Greens')
)
chi_ndvi_cdc_gdf.head()
| place2010 | tract2010 | ST | PlaceName | plctract10 | PlcTrPop10 | geometry | year | tract_x | asthma | asthma_ci_low | asthma_ci_high | data_value_unit | totalpopulation | totalpop18plus | tract_y | total_pixels | frac_veg | mean_patch_size | edge_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1714000 | 17031010100 | 17 | Chicago | 1714000-17031010100 | 4854 | POLYGON ((-9758835.381 5164429.383, -9758837.3... | 2022 | 17031010100 | 10.3 | 9.2 | 11.5 | % | 4905 | 4083 | 17031010100 | 1348088 | 0.140350 | 55.225919 | 0.118612 |
| 1 | 1714000 | 17031010201 | 17 | Chicago | 1714000-17031010201 | 6450 | POLYGON ((-9760143.496 5163888.741, -9760143.4... | 2022 | 17031010201 | 11.1 | 9.9 | 12.4 | % | 6939 | 5333 | 17031010201 | 1632112 | 0.200683 | 57.543394 | 0.161668 |
| 2 | 1714000 | 17031010202 | 17 | Chicago | 1714000-17031010202 | 2818 | POLYGON ((-9759754.212 5163883.196, -9759726.6... | 2022 | 17031010202 | 9.7 | 8.7 | 10.9 | % | 2742 | 2296 | 17031010202 | 1146600 | 0.158785 | 63.260250 | 0.123673 |
| 3 | 1714000 | 17031010300 | 17 | Chicago | 1714000-17031010300 | 6236 | POLYGON ((-9758695.229 5163870.91, -9758695.78... | 2022 | 17031010300 | 9.6 | 8.6 | 10.8 | % | 6305 | 5606 | 17031010300 | 1711710 | 0.146512 | 57.113642 | 0.126384 |
| 4 | 1714000 | 17031010400 | 17 | Chicago | 1714000-17031010400 | 5042 | POLYGON ((-9757724.634 5160715.939, -9757742.2... | 2022 | 17031010400 | 10.2 | 9.1 | 11.4 | % | 5079 | 4742 | 17031010400 | 3119840 | 0.096548 | 52.983817 | 0.079474 |
The patterns of asthma match the patterns observed previously, with the greatest prevalence of asthma in the West and South of Chicago. Edge denstiy decreases towards the center of the city, and increases further from the center. Interestingly, areas with low edge density appear to correspond with areas of low asthma prevalence. Edge density is a measurement of the continuity of vegetation - lower edge density means more contiguous and consistent vegetation cover, which generally indicates healthier ecosystems. It makes sense then, that areas with lower edge density correspond to areas with lower asthma, as areas with lower edge density would have healthier ecosystems that are therefore more effective at removing pollutants from the air.
It's difficult to tell from these plots, but it's possible they might correspond to some degree with some of Chicago's sustainability and urban greening initiatives. The areas near I-90 and I-55 both have low edge density, as well as a 2011 sustainability plan, although some of the other sustainability plans do not correlate with decreases in edge density:

Source: City of Chicago planning and development. Copyright © 2010 - 2025 City of Chicago. Accessed 2/5/2025. https://www.chicago.gov/city/en/depts/dcd/provdrs/planning_and_policydivision/svcs/open-space-and-sustainability-plans.html
Do you see any similarities in your plots? Do you think there is a relationship between adult asthma and any of your vegetation statistics in Chicago? Relate your visualization to the research you have done (the context of your analysis) if applicable.
Explore a linear ordinary least-squares regression
Model description¶
Model description and assumptions:
An ordinary least squares regression searches for a line of best fit through a series of data points. It does this by minimizing the sum of the squared differences between the data points and the line. The model has five assumptions: linearity, independence, homoscedasticity, non-multicollinearity, and normality. Linearity assumes a linear relationship between the dependent and independent variables, independence assumes that data points do not influence eachother, homoscedasticity assumes that error is more or less evenly distributed for the independent variable, non-mutlicollinearity assumes that variables aren't inherently too closely correlated, and normality of errors assumes that error is randomly distributed (roughtly following a normal distribution curve).
In the case of our model, asthma will be the dependent variable, and independent variables will be edge density and mean patch size. Judging from the scatter plots of the data printed below, edge density has a relatively linear correlation with asthma (after it's been transformed into log(asthma)), but mean patch size does not. Independence seems like a valid assumption, since one person having asthma doesn't influence whether another person nextdoor has asthma (unlike, for example, tuberculosis). Homoscedasticity may or may not be a valid assumption - this could be evaluated by plotting out the error between the model's prediction of the independent variables (edge density and mean patch size) and the observed value, but I considered that to be beyond the scope of the assignment for now.
In the case of non-multicollinearity, it's difficult to assess whether or not this is a reasonable assumption. If my understanding is correct, for example, edge density and asthma would be multicollinear if asthma increased in areas where edge density increased, but not becasue edge density is increasing. For example, vegetation might be more fragmented and asthma might be more prevalent in poorer areas, so the numbers would correlate but high edge density would not be necessarily causing asthma. This is difficult to assess. Finally, normality of error could be tested through creating a plot of the model error. This was not done (considered out of scope), so the validity of this assumption cannot be verified.
# Variable selection and transformation
model_df = ( # This section creates a new object (model_df) that contains only the columns listed, and removes any nan values.
chi_ndvi_cdc_gdf.copy()
[['frac_veg', 'asthma', 'mean_patch_size', 'edge_density', 'geometry']]
.dropna()
)
model_df['log_asthma'] = np.log(model_df.asthma) # This performs a log transform on the asthma data, and puts the result into a new column, 'log_asthma'.
# Plot scatter matrix to identify variables that need transformation
hvplot.scatter_matrix(
model_df
[[
'mean_patch_size',
'edge_density',
'log_asthma'
]]
)
# Select predictor and outcome variables
x = model_df[['edge_density', 'mean_patch_size']] # selecting edge density and mean patch size as the predictor, and log asthma as the outcome.
y= model_df[['log_asthma']]
# Split into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42) # I don't really understand this part...
# Fit a linear regression
reg = LinearRegression()
reg.fit(x_train, y_train)
# Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(x_test))
y_test['asthma'] = np.exp(y_test.log_asthma)
# Plot measured vs. predicted asthma prevalence with a 1-to-1 line
y_max = y_test.asthma.max()
(
y_test.hvplot.scatter(
x='asthma', y='pred_asthma',
xlabel='Measured Adult Asthma Prevalence',
ylabel='Predicted Adult Asthma Prevalence',
title='Linear Regression Performance - Testing Data'
)
.opts(aspect='equal',
xlim=(0, y_max), ylim=(0, y_max),
width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
# Compute model error for all census tracts
model_df['pred_asthma'] = np.exp(reg.predict(x)) # not sure what "np.exp(reg.predict(x))" does
model_df['err_asthma'] = model_df['pred_asthma'] - model_df['asthma']
# Plot error geographically as a chloropleth
(
plot_chloropleth(model_df, color='err_asthma', cmap='RdBu')
.redim.range(err_asthma=(-.3, .3))
.opts(frame_width=600, aspect='equal')
)
Areas that had observed low asthma prevalence had positive error, meaning that predicted values were larger than observed values. Areas with high asthma prevalence had negative error, meaning that observed values were higher than predicted values. So, the model is under-predicting in areas with high asthma prevalence, and over-predicting in areas with low asthma prevalence.
Based on the scatter plot and the interactive plot, there's some spatial autocorrelation going on: areas that are near one another are showing similar patterns of error. Considering that observed values were below the line of best fit in areas with low asthma, and above the line of best fit in areas with high asthma, perhaps a different model could be used to better predict obseverd asthma values. For example, a model based on a logistic function would create an S-shaped curve, which might more closely fit the observed data.